DAT-ATX-1 Capstone Project

Nikolaos Vergos, February 2016

1. Data Preparation & Exploratory Analysis

Disclaimer: Many ideas for the Exploratory Data Analysis part, particularly the discussion of the health inspection scores histogram, the correlation between two different inspections and the plot_restaurant_score_by_type Python function and the discussion of the resulting plots have been taken from the analysis of Dr. Tal Yarkoni

0. Import libraries & packages



In [1]:

    
import warnings
warnings.filterwarnings('ignore')



In [2]:

    
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

from scipy import stats # For correlation coefficient calculation

1. Familiarize ourselves with our data



In [4]:

    
#Reading the dataset in a dataframe using Pandas
df = pd.read_csv('../data/Restaurant_Inspection_Scores.csv')  

#Print first observations
df.head()









    Out[4]:






  
    
      
      Restaurant Name
      Zip Code
      Inspection Date
      Score
      Address
      Facility ID
      Process Description
    
  
  
    
      0
      Corner Store #3958
      78754
      12/23/2015
      100
      12307 DESSAU RD\nAUSTIN, TX 78754\n(30.3872270...
      11281491
      Routine Inspection
    
    
      1
      Mr. Gatti's #118
      78758
      12/23/2015
      94
      2121 W PARMER LN\nAUSTIN, TX 78758\n(30.415649...
      2801996
      Routine Inspection
    
    
      2
      Texaco Braker Mart
      78758
      12/23/2015
      83
      2601 W BRAKER LN\nAUSTIN, TX 78758\n(30.390268...
      10001712
      Routine Inspection
    
    
      3
      Taco Bell
      78753
      12/23/2015
      93
      9504 N LAMAR BLVD\nAUSTIN, TX 78753\n(30.36480...
      11066396
      Routine Inspection
    
    
      4
      Subway
      78758
      12/23/2015
      98
      2501 W PARMER LN\nAUSTIN, TX 78758\n(30.418236...
      10385802
      Routine Inspection



In [5]:

    
len(df)









    Out[5]:





22879



In [6]:

    
# Drop duplicate entries
df = df.drop_duplicates()



In [7]:

    
print "There are {0} entries (rows) in this dataframe".format(len(df)) # Initial assessment









    



There are 22875 entries (rows) in this dataframe

Almost all column names have spaces between the words. Even though it looks nice on the dataframe, it can actually be quite problematic further down the way. Therefore, let us implement an easy fix that will save us lots of pain later on:



In [8]:

    
df = df.rename(columns={'Restaurant Name': 'Restaurant_Name', 'Zip Code' : 'Zip_Code',
                        'Inspection Date' : 'Inspection_Date', 'Facility ID' : 'Facility_ID',
                       'Process Description' : 'Process_Description'})

The "Inspection_Date" column looks like a humanly readable series of dates, however it is encoded as (useless) strings. Fortunately we can easily tell pandas to see this column as a datetime format for easier and better reference:



In [9]:

    
from datetime import datetime

df['Inspection_Date'] = pd.to_datetime(df['Inspection_Date'])

Let us also create an auxiliary DataFrame where each establishment shows up once (even though there are 5-6 inspection instances per establishment) - we need to know how many individual establishments are inspected. Side note: I am using the word "establishment" instead of "restaurant", because the Austin/Travis County Health and Human Services Department conducts the permitting and inspection of more than 4,000 food establishments in Austin, several local municipalities and rural Travis County. This includes any place serving food: restaurants, school cafeterias, grocery and convenience stores etc. We will deal with this later, because we have chosen to focus our study on restaurants (loosely defined as places that a potential patron chooses to go dine or carry out)



In [10]:

    
establishments = df.groupby('Facility_ID')



In [11]:

    
# Print some stuff
print "Number of inspections:", len(df)
print "Number of individual establishments inspected:", len(establishments)
print "\nColumns:\n", df.dtypes









    



Number of inspections: 22875
Number of individual establishments inspected: 4774

Columns:
Restaurant_Name                object
Zip_Code                        int64
Inspection_Date        datetime64[ns]
Score                           int64
Address                        object
Facility_ID                     int64
Process_Description            object
dtype: object

We have nearly 23,000 inspections from over 4,700 distinct establishments. For each inspection, we have (among other things) a health inspection score, an establishment name, an inspection date, and a description of the process (i.e., whether the inspection was routine, or reflected a follow-up visit triggered by an establishment failing an earlier inspection).

Let us start our exploratory data analysis by a simple frequency table of the health inspection scores:



In [12]:

    
print df['Score'].value_counts().sort_index() # Frequency Table









    



36        1
45        1
46        1
48        1
50        3
51        2
53        5
54        1
56        6
57        7
58        6
59       14
60       11
61       10
62        9
63       16
64       14
65       19
66       14
67       12
68       23
69       15
70      125
71      115
72      129
73      140
74      150
75      158
76      175
77      203
78      261
79      228
80      305
81      397
82      419
83      409
84      499
85      618
86      642
87      620
88      701
89      859
90     1137
91      802
92      897
93     1645
94     1438
95      475
96     1274
97     3099
98       76
99       76
100    4612
Name: Score, dtype: int64

The first striking result is that, given the health inspectors are doing their job meticulously, Austin is a quite sanitary city for diners. Recall that a failing score is below 70, and according to the ad hoc letter grade scheme we are going to implement, the vast majority of inspections leeds to a letter grade "A", which we would say symbolizes a "pristine" establishment. We are going to perform some thorough research in the Machine Learning part of this Project, however this dataset can yield some quite interesting visualizations:



In [13]:

    
# Function to plot a histogram based on the frequency table above:

def plot_histogram():
    df['Score'].hist(bins=50)
    plt.xlabel('Health Inspection Score')
    plt.ylabel('Count')
    plt.savefig('histogram.png')

plot_histogram()



In [14]:

    
# A somewhat prettier histogram with seaborn; Relative frequencies on y-axis
sns.distplot(df['Score']);

We see that the distribution of our data is heavily skewed to the left (i.e. bunched up toward the right with a "tail" stretching toward the left). There are zero health inspection scores of 98 or 99--presumably because the smallest possible violation results in a loss of at least 3 points. A similar explanation probably accounts for the dip around 95 i.e., it's relatively unusual to getting docked exactly 5 points for a single infraction.

More curious, though, is what happens at the low end of the distribution--specifically around a score of 70, where there appears to be a relatively sharp discontinuity. To see it better, we can zoom in:



In [15]:

    
plot_histogram()
plt.xlim([50,80]);
plt.ylim([0,300]);

There is a sharp break at 70. Recall that 70 is the minimum score required to pass inspection. If a restaurant gets a score below that, it fails inspection, which presumably triggers all kinds of unpleasant things - e.g., more paperwork for the inspector, bad PR for the restaurant, and a follow-up (non-routine) inspection one or two weeks later. So one possibility here is that there's some pressure on inspectors - whether explicit or implicit - to avoid giving restaurants failing scores. This could possibly explain the fact that 99% of the entries in our data set have passing scores, as we are going to see later. Unfortunately this great imbalance between passing and failing restaurants will significantly hinder our attempts to build predictive models with confidence.

Dr. Tal Yarkoni offers some more interesting explanations: For example, it's possible that the health inspection guidelines explicitly allow inspectors to let restaurants off with a warning. Or perhaps the scoring algorithm isn't just a linear sum of all the violations observed, and there has to be some particularly egregious health code violation in order for a restaurant to receive a score below 70. Or, inspectors may be pragmatically working around weaknesses in the code - e.g., a restaurant may theoretically be able to fail inspection because of a large number of minor infractions, even if no single infraction presents any meaningful health risk to customers.

Still, absent an explanation, there's at least some potential cause for concern here, and it's certainly consistent with the data that health inspectors might be systematically grading restaurants more leniently than they should be. Are the inspectors doing their job meticulously? Is Austin dining really so sanitary overall?

This is a recurring theme of this study, as we are going to see.

2. Data Preparation

On to the immensely boring, yet quintessential part: data cleaning. I will start by renaming some of our restaurants so they'll avoid the axe when I remove everything which isn't a restaurant - as I have mentioned, unfortunately the data set includes every establishment that needs a health inspection, and I'm only interested in places where people actually choose to dine in, not schools, churches, jails (!) etc...



In [16]:

    
# Unfortunately the words "Market", "Grocery", "Church" correspond to places
# like convenience stores and grocery stores.

# I want to make sure that some restaurants with those words in their name will go through my filter.

df.loc[(df['Restaurant_Name'].str.contains('Whole')) , 'Restaurant_Name'] = "Whole Foods"
df.loc[(df['Restaurant_Name'].str.contains('Central Market')) , 'Restaurant_Name'] = "Central Mkt"
df.loc[(df['Restaurant_Name'].str.contains('Boston Market')) , 'Restaurant_Name'] = "Boston Mkt"
df.loc[(df['Restaurant_Name'].str.contains('Mandola')) , 'Restaurant_Name'] = "Mandola's"
df.loc[(df['Restaurant_Name'].str.contains('Royal Blue')) , 'Restaurant_Name'] = "Royal Blue"
df.loc[(df['Restaurant_Name'].str.contains('Rudy')) , 'Restaurant_Name'] = "Rudy's"
df.loc[(df['Restaurant_Name'].str.contains('Fit Foods')) , 'Restaurant_Name'] = "My Ft Foods"
df.loc[(df['Restaurant_Name'].str.contains("Church's Chicken")) , 'Restaurant_Name'] = "Chrch Chicken"
df.loc[(df['Restaurant_Name'].str.contains("Schlotzsky's")) , 'Restaurant_Name'] = "Schlotzsky's"



In [17]:

    
len(df)









    Out[17]:





22875

Let me now filter out everything that doesn't follow my loose definition of a restaurant:



In [18]:

    
df = df[df['Restaurant_Name'].str.contains('School') == False]
df = df[df['Restaurant_Name'].str.contains('Elementary') == False]
df = df[df['Restaurant_Name'].str.contains('Care') == False]
df = df[df['Restaurant_Name'].str.contains('Middle') == False]
df = df[df['Restaurant_Name'].str.contains('Cafeteria') == False]
df = df[df['Restaurant_Name'].str.contains('Jail') == False]
df = df[df['Restaurant_Name'].str.contains('ISD') == False]
df = df[df['Restaurant_Name'].str.contains('Academy') == False]

df = df[df['Restaurant_Name'].str.contains('Mart') == False]
df = df[df['Restaurant_Name'].str.contains('Gas') == False]
df = df[df['Restaurant_Name'].str.contains('Convenience') == False]
df = df[df['Restaurant_Name'].str.contains('7-Eleven') == False]
df = df[df['Restaurant_Name'].str.contains('HEB') == False]
df = df[df['Restaurant_Name'].str.contains('Station') == False]
df = df[df['Restaurant_Name'].str.contains('Randall') == False]
df = df[df['Restaurant_Name'].str.contains('Target') == False]
df = df[df['Restaurant_Name'].str.contains('Flea') == False]

df = df[df['Restaurant_Name'].str.contains('Gym') == False]
df = df[df['Restaurant_Name'].str.contains('Fitness') == False]
df = df[df['Restaurant_Name'].str.contains('Fit') == False]

df = df[df['Restaurant_Name'].str.contains('Church') == False]
df = df[df['Restaurant_Name'].str.contains('Dollar') == False]
df = df[df['Restaurant_Name'].str.contains('Store') == False]
df = df[df['Restaurant_Name'].str.contains('Texaco') == False]



In [19]:

    
print len(df)



In [20]:

    
restaurants = df.groupby('Facility_ID') # We can switch from "establishments" to "restaurants" after the purge
len(restaurants)









    Out[20]:





3934

We got rid of 4583 DataFrame rows corresponding to inspections of 840 establishments. Not bad at all!

3. Data Exploration & Visualizations

An important question is whether scores are stable over time within individual restaurants. Since the vast majority of restaurants have been inspected several times (mostly twice a year) over the last three years, we can directly answer this question by computing the test-retest Pearson correlation metric across multiple inspections.

A quick, though somewhat crude, way to do this is to randomly select two inspections for every restaurant with multiple inspections and compute the correlation. The resulting scatter plot looks like this:



In [22]:

    
# Cite: Analysis by Dr. Tal Yarkoni

# Filter for restaurants with > 1 inspection

two_or_more = restaurants.filter(lambda x: x.shape[0] > 1)
print "Number of restaurants with two or more inspections:", two_or_more['Facility_ID'].nunique()


# Shuffle order and select a random pair for each restaurant

two_or_more = two_or_more.reindex(np.random.permutation(two_or_more.index))
random_pairs = two_or_more.groupby('Facility_ID', as_index=False).head(2).sort('Facility_ID')
random_pairs['number'] = np.tile(np.array([1,2]), len(random_pairs)/2)
pairs = random_pairs.pivot(index='Facility_ID', columns='number', values='Score')
r, p = stats.pearsonr(*pairs.values.T)

# Plot the relationship

f, ax = plt.subplots(figsize=(6, 6))
sns.regplot(pairs[1], pairs[2], x_jitter=2, y_jitter=2,
            color="#334477", scatter_kws={"alpha": .05, "s":100})

ax.text(62, 72, "r = %.2f" % r, fontsize=14)
ax.set(xlim=(60, 105), ylim=(60, 105),
        xlabel='Score for Inspection 1',
        ylabel='Score for Inspection 2');









    



Number of restaurants with two or more inspections: 3682

The test-retest correlation of 0.47 indicates that there is indeed a fair amount of consistency to the scores.

That's reassuring, in that a very low correlation might lead us to worry that the health inspection process itself is unreliable, since it's hard to imagine that there aren't real differences in how mindful different proprietors are of health code - not to mention the fact that some kinds of establishments are likely to be at much higher risk of code violations than others in virtue of the kind of food they serve.

In simpler words, if a restaurant does well, it will probably keep doing well. If it barely passes inspection one time, it probably won't improve dramatically for the second time. However, there's a quite interesting phenomenon at the "outliers" of the scatterplot above: There seem to be some restaurants that have obtained "pristine" scores during inspection 1 and then dropped by 20 points or more during inspection 2 and vice versa. The vast majority of establishments show a fair grade of consistency though.

The following part could have been much easier and much more exciting, had the City of Austin provided us with a richer dataset including some extra columns with information such as type of cuisine or some classification of infestations that have caused point deductions. A significantly richer analysis would have been possible in that case.

We can still implement an ad hoc workaround to break out restaurants by the kind of food they serve, provided their names are indicative of the cuisine. we're going to categorize restaurant type much more crudely. We'll take advantage of the fact that many restaurants use their name to announce the kind of food they serve - witness, for example, "Carino's Italian", "Asia Chinese Restaurant", and "Mi Casa Mexican Restaurant". Dr. Yarkoni did an amazing job:

By grouping together restaurants with the same stem of a word in their names, we can generate the following plot:



In [23]:

    
# Create unique ID column that's human-readable--concatenate name and address

df['string_id'] = [x.lower() for x in (df['Restaurant_Name'] + '_' + df['Facility_ID'].astype(str))]



In [24]:

    
def plot_restaurant_score_by_type(types):
    """ Takes a list of strings, each defining a group of restaurants that contain that 
    particular string within their title. """
    
    means, sems, labels = [], [], []
    n_types = len(types)
    
    # Generate means, CI/SEM, and labels
    for c in types:
        stores = df[df['string_id'].str.contains(c)]
        unique_stores = stores.groupby('string_id')['Score'].mean()
        n_stores = len(unique_stores)
        n_inspections = len(stores)
        std = unique_stores.std()
        means.append(unique_stores.mean())
        sems.append(stats.sem(unique_stores)) # sem: standard error of the mean
        labels.append('"' + c + '" (%d, %d)' % (n_stores, n_inspections))
    
    # Order by descending score
    plot_data = pd.DataFrame({'mean':means, 'sem':sems}, index=labels)
    plot_data = plot_data.sort('mean', ascending=True)

    # Plot
    pal = sns.color_palette("husl", len(plot_data))
    f, ax = plt.subplots(figsize=(4, 8))
    for y, (label, (mean, sem)) in enumerate(plot_data.iterrows()):
        ax.errorbar(mean, y, xerr=sem, fmt='o', color=pal[y])
    ax.set_ylim([-1, n_types])
    ax.set_yticks(np.arange(len(plot_data)))
    ax.set_yticklabels(plot_data.index, rotation=0, horizontalalignment='right')
    ax.set_xlabel('Health inspection score', fontsize=14)
    ax.set_ylabel('Restaurant name contains...', fontsize=14)
    
types = ['chin', 'mexic', 'indian', 'thai', 'vietnam|pho', 'italia', 'taco|taqu', 'sushi|jap', 
         'pizz', 'korea', 'burger', 'donut|doughnut', 'coffee', 'bakery', 'ice cream', 
         'chicken', 'buffet', 'grill', 'bbq|barbe', 'steak', 'greek', 'beer']



In [25]:

    
plot_restaurant_score_by_type(types)

The values in parentheses on the y-axis labels denote the number of unique restaurants and total inspections in each group, respectively. The error bars denote the 95% confidence interval.

The plot largely corroborates what we probably already knew, with perhaps a couple of minor surprises. It's not surprising that establishments primarily serving coffee, ice cream, or pastries tend to do very well on health inspections. The same applies for most burger and pizza places, and I believe this has to do with strict health protocols universally implemented by national chains that seem to dominate those categories. As we will see with the next plot, restaurants belonging to major chains tend to do exceptionally well in health inspections with very little variance.

At the other end of the spectrum, the 6 restaurants with the word "buffet" in their name do quite... abysmally. Their average score of 79 is pretty close to the magical failure number of 70. Across 34 different inspections, no "buffet"-containing restaurant in Austin has managed to obtain a score higher than 90 in the past 3 years. Of course, this conclusion only applies to buffet restaurants that have the word "buffet" in their name, but I believe those comprise the vast majority of all-you-can-eat type eateries, even though I hope I am wrong, because to be honest sometimes buffets can be quite yummy...

Also not happy winners in this analysis: restaurants serving ethnic food - with the possible exception of Indian restaurants (though the variance for Indian restaurants is high. Asian restaurants do particularly poorly; for example, "thai"-containing restaurants obtain a score of 83, on average.

It seems that every restaurant following the "bbq" category could be classfied as "ethnic" and this leads to an interesting question, which we cannot really answer given the data we have: Are ethnic restaurants bound to be doing worse than their "american" counterparts, or is there some kind of inspector bias against "different" cuisines? Of course diners can form their individual opinions freely and I do not wish to nudge the discussion toward either side, however this trend is really striking to me.

On a positive note, though, we can see that the 74 "burger"-containing establishments in this list - you know, the ones where cooks spend much of their day wading around in buckets of raw meat--tend to have very good scores (perhaps because of the perception that health inspectors are gunning for them, I don't know). So, given a choice between Thai food and burger, health code-wise, you're arguably better off with burger. Of course, in the grand scheme of things, these are not terribly large differences, and the vast majority of the time, you're going to be just fine after eating pretty much anywhere (including even those restaurants that fail inspection).

Various discussions with colleagues have lead me to the result that this is most probably happening because national chains can't and won't tolerate PR scandals associated with the sanitary qualities of their food, therefore their protocols concerning food preparations are quite strict, as I have mentioned above.

Let's eat clean tonight! - Pizza Hut, right???

Since it's easy to bin restaurants by title, something else we can do is look at the performance of major local and national food chains. Here's what that looks like, for selected chains with major representation in Austin:



In [26]:

    
chains = ['starbucks', 'mcdonald', 'subway', 'popeye', 'whataburger', 'domino', 
         'jamba', 'schlotzsky', 'taco shack', 'burger king', 'wendy', 
         'panda', 'chick-fil', 'pizza hut', 'papa john', 'chipotle', 'pei wei', 'torchy', 'tacodeli']



In [27]:

    
plot_restaurant_score_by_type(chains)

Good news! It's pretty much impossible to get sick at Jamba Juice or Starbucks. The large pizza chains all do very well. Also, if you're looking for health code-friendly Asian food, Panda Express is your new best friend. If your primary basis for deciding between drive-thrus is tiny differences in health inspection score, you may want to opt for local hero Whataburger, Wendy's or Burger King over McDonald's. Otherwise, there's nothing terribly interesting here, other than the suprisingly low (for me) scores of Pei Wei, which I have to admit I used to hold in a higher esteem. An average score of 88 of course isn't something to be very disappointed about, but it seems that they, being a national chain, are slightly more lax when it comes to their food preparation protocols compared to their more "fast-food" counterpart, Panda Express.

The reader is strongly encouraged to visit Tal Yarkoni's notebook for further exploratory data analysis on the Ceity of Austin Restaurant Health Inspection dataset.

4. Enriching our dataset by adding numerics

As we have seen, our only quantitative variable in our data set is each restaurant's score. Even though the given data set can lead us to a fairly thorough exploration and raise some quite interesting questions, as we have seen so far, my opinion is that our output would be much richer if we had more information. The steps I took to address this are:

I unilaterally decided to trim down the area of interest. Health Inspections are conducted by Travis County which is rather spacious and includes the City of Austin, the neighboring communities like Pflugerville as well as some far-flung suburbs like Lago Vista or Manor. I spent some time over a map and ended up deciding to focus my analysis from now on in my loose definition of the Austin City Limits. Roughly this translates as eliminating all rows from my DataFrames that belong to the suburban Zip Codes.

I did some more research on the remaining 36 Zip Codes after the purge above, and following U.S. Census Data from 2014 I incorporated three more columns into my DataFrame: Population, Median Income and Home Ownership percentage for each one of the Austin City Limits Zip Codes.



In [28]:

    
# Let's see how many Zip Codes do we start with:

print len(pd.unique(df.Zip_Code.ravel()))

Those 56 Zip Codes cover the entire area of Travis County where the Austin/Travis County Health and Human Services Department conducts the inspections. After spending some time studying a detailed map, I decided to only keep a subset of those 56 Zip Codes from now on, even though this means I am trimming down my data set. I am interested in Austin proper:



In [29]:

    
# Focus on the main part of the city
# The geographical division is mine

SE_zip = (78744, 78747, 78719, 78741)
Central_zip = (78701, 78702, 78703, 78705, 78721, 78723, 78712, 78751, 78756)
NE_zip = (78752, 78753, 78754)
NW_zip = (78757, 78758, 78727, 78722, 78729, 78717, 78750, 78759, 78726, 78730, 78731, 78732)
SW_zip = (78704, 78745, 78748, 78739, 78749, 78735, 78733, 78746)

ACL_zip = SE_zip + Central_zip + NE_zip + NW_zip + SW_zip
len(ACL_zip)









    Out[29]:





36

36 Zip Codes of a more or less contiguous area is what we will be focusing on from now on.



In [30]:

    
# Create a new DataFrame only including Austin City Limits restaurants:

ACL = df[df.Zip_Code.isin(ACL_zip)]
ACL.describe()









    Out[30]:






  
    
      
      Zip_Code
      Score
      Facility_ID
    
  
  
    
      count
      16113.000000
      16113.000000
      16113.000000
    
    
      mean
      78732.831565
      90.556383
      7290115.054118
    
    
      std
      21.670047
      8.003905
      3846991.045580
    
    
      min
      78701.000000
      36.000000
      2800004.000000
    
    
      25%
      78705.000000
      86.000000
      2801979.000000
    
    
      50%
      78741.000000
      92.000000
      10091552.000000
    
    
      75%
      78752.000000
      97.000000
      10662646.000000
    
    
      max
      78759.000000
      100.000000
      11386831.000000

Let us now introduce the U.S. 2014 Census numerics data:



In [31]:

    
raw_data = {
        'Zip_Code': [78701, 78702, 78703, 78705, 78721, 78723, 78712, 78751, 78756, 78744, 78747, 78719, 78741,\
                    78752, 78753, 78754, 78704, 78745, 78748, 78739, 78749, 78735, 78733, 78746, 78757, 78758,\
                    78727, 78722, 78729, 78717, 78750, 78759, 78726, 78730, 78731, 78732],
        'Med_Income': [35757, 23348, 54591, 14740, 26646, 34242, np.nan, 29779, 36978, 38256, 60861, np.nan, 25369,\
                      30207, 38206, 51810, 35733, 43458, 57710, 102707, 68244, 75204, 102239, 100571,\
                      45090, 42398, 62648, 35794, 59497, 87290, 78428, 61284, 89891, 128524, 62404, 103951],
        'Population': [3780, 22498, 19522, 26972, 10192, 30196, np.nan, 13829, 7253, 34028, 4908, np.nan, 40678,\
                      17978, 43788, 5517, 43343, 53136, 25138, 8708, 28420, 9563, 9144, 25768,\
                      21434, 42977, 22332, 6264, 24539, 8209, 23563, 40327, 6547, 4848, 24068, 3804],
        'Home_Ownership': [0.377, 0.467, 0.515, 0.11, 0.576, 0.441, np.nan, 0.259, 0.395, 0.536, 0.89, np.nan, 0.146,\
                          0.266, 0.407, 0.11, 0.302, 0.09, 0.739, 0.962, 0.36, 0.564, 0.901, 0.71,\
                          0.566, 0.309, 0.573, 0.468, 0.516, 0.793, 0.717, 0.451, 0.7, 0.74, 0.593, 0.97]}

df_numerics = pd.DataFrame(raw_data, columns = ['Zip_Code','Med_Income', 'Population', 'Home_Ownership'])
df_numerics









    Out[31]:






  
    
      
      Zip_Code
      Med_Income
      Population
      Home_Ownership
    
  
  
    
      0
      78701
      35757
      3780
      0.377
    
    
      1
      78702
      23348
      22498
      0.467
    
    
      2
      78703
      54591
      19522
      0.515
    
    
      3
      78705
      14740
      26972
      0.110
    
    
      4
      78721
      26646
      10192
      0.576
    
    
      5
      78723
      34242
      30196
      0.441
    
    
      6
      78712
      NaN
      NaN
      NaN
    
    
      7
      78751
      29779
      13829
      0.259
    
    
      8
      78756
      36978
      7253
      0.395
    
    
      9
      78744
      38256
      34028
      0.536
    
    
      10
      78747
      60861
      4908
      0.890
    
    
      11
      78719
      NaN
      NaN
      NaN
    
    
      12
      78741
      25369
      40678
      0.146
    
    
      13
      78752
      30207
      17978
      0.266
    
    
      14
      78753
      38206
      43788
      0.407
    
    
      15
      78754
      51810
      5517
      0.110
    
    
      16
      78704
      35733
      43343
      0.302
    
    
      17
      78745
      43458
      53136
      0.090
    
    
      18
      78748
      57710
      25138
      0.739
    
    
      19
      78739
      102707
      8708
      0.962
    
    
      20
      78749
      68244
      28420
      0.360
    
    
      21
      78735
      75204
      9563
      0.564
    
    
      22
      78733
      102239
      9144
      0.901
    
    
      23
      78746
      100571
      25768
      0.710
    
    
      24
      78757
      45090
      21434
      0.566
    
    
      25
      78758
      42398
      42977
      0.309
    
    
      26
      78727
      62648
      22332
      0.573
    
    
      27
      78722
      35794
      6264
      0.468
    
    
      28
      78729
      59497
      24539
      0.516
    
    
      29
      78717
      87290
      8209
      0.793
    
    
      30
      78750
      78428
      23563
      0.717
    
    
      31
      78759
      61284
      40327
      0.451
    
    
      32
      78726
      89891
      6547
      0.700
    
    
      33
      78730
      128524
      4848
      0.740
    
    
      34
      78731
      62404
      24068
      0.593
    
    
      35
      78732
      103951
      3804
      0.970

78712 is the special Zip Code for the University of Texas Campus and 78719 corresponds to Austin - Bergstrom International Airport.

Let us now merge the two DataFrames, ACL and df_numerics into a common one:



In [32]:

    
merged = pd.merge(ACL, df_numerics, on='Zip_Code')
merged.head()









    Out[32]:






  
    
      
      Restaurant_Name
      Zip_Code
      Inspection_Date
      Score
      Address
      Facility_ID
      Process_Description
      string_id
      Med_Income
      Population
      Home_Ownership
    
  
  
    
      0
      Mr. Gatti's #118
      78758
      2015-12-23
      94
      2121 W PARMER LN\nAUSTIN, TX 78758\n(30.415649...
      2801996
      Routine Inspection
      mr. gatti's #118_2801996
      42398
      42977
      0.309
    
    
      1
      Subway
      78758
      2015-12-23
      98
      2501 W PARMER LN\nAUSTIN, TX 78758\n(30.418236...
      10385802
      Routine Inspection
      subway_10385802
      42398
      42977
      0.309
    
    
      2
      Baskin Robbins
      78758
      2015-12-23
      99
      12407 N MOPAC EXPY\nAUSTIN, TX 78758\n(30.4174...
      2802274
      Routine Inspection
      baskin robbins_2802274
      42398
      42977
      0.309
    
    
      3
      JR's Tacos
      78758
      2015-12-22
      91
      1921 CEDAR BEND DR\nAUSTIN, TX 78758\n(30.4083...
      10964220
      Routine Inspection
      jr's tacos_10964220
      42398
      42977
      0.309
    
    
      4
      Econo Lodge
      78758
      2015-12-22
      91
      9100 BURNET RD\nAUSTIN, TX 78758\n(30.37479048...
      10778546
      Routine Inspection
      econo lodge_10778546
      42398
      42977
      0.309

5. DataFrame TLC and a very interesting visualization

Our "main" DataFrame needs some TLC. We also have to exploit the "Address" column further by creating neat columns with street address, city information and coordinates for each row - even though we are focusing in the contiguous Austin area, we are still having the areas of Rollingwood, West Lake Hills, Sunset Valley and Bee Cave to take into account (and get some interesting associations between Restaurant Health Inspection Score and Income...)



In [33]:

    
# Creation of a DataFrame off of the constituent components of the 'Address' Column:

merged_location = merged['Address'].apply(lambda x: pd.Series(x.split('\n')))



In [34]:

    
merged_location.head()









    Out[34]:






  
    
      
      0
      1
      2
    
  
  
    
      0
      2121 W PARMER LN
      AUSTIN, TX 78758
      (30.4156492668216, -97.6975693977174)
    
    
      1
      2501 W PARMER LN
      AUSTIN, TX 78758
      (30.4182363972979, -97.7022420545932)
    
    
      2
      12407 N MOPAC EXPY
      AUSTIN, TX 78758
      (30.4174621728585, -97.7028442208035)
    
    
      3
      1921 CEDAR BEND DR
      AUSTIN, TX 78758
      (30.408322328997, -97.6982433226702)
    
    
      4
      9100 BURNET RD
      AUSTIN, TX 78758
      (30.3747904817031, -97.7264170496741)

Let's reformat the first two columns for geocoding:



In [35]:

    
geocode = merged_location[0] + ', ' + merged_location[1]



In [36]:

    
geocode.head()









    Out[36]:





0      2121 W PARMER LN, AUSTIN, TX 78758
1      2501 W PARMER LN, AUSTIN, TX 78758
2    12407 N MOPAC EXPY, AUSTIN, TX 78758
3    1921 CEDAR BEND DR, AUSTIN, TX 78758
4        9100 BURNET RD, AUSTIN, TX 78758
dtype: object



In [37]:

    
# What about the coordinates column?

merged_location[2].head(2)









    Out[37]:





0    (30.4156492668216, -97.6975693977174)
1    (30.4182363972979, -97.7022420545932)
Name: 2, dtype: object

Each element of column [2] of the merged_location DataFrame is a parenthesis with a pair of coordinates inside it, representing latitude and longitude. Those seem like an interesting addition to our dataframe, therefore we'll do a little bit more work to add them as individual columns:



In [38]:

    
# Get rid of the parentheses:

coords = pd.Series(merged_location[2])

stripped_coords = []
for coord in coords:
    coord = coord.strip('()')
    stripped_coords.append(coord)



In [39]:

    
stripped_coords[1]









    Out[39]:





'30.4182363972979, -97.7022420545932'



In [40]:

    
merged['Coordinates'] = pd.Series(stripped_coords)



In [41]:

    
merged['Latitude'] = 0
merged['Longitude'] = 0

Following Chris Albon's excellent code snippet:



In [42]:

    
# Create two lists for the loop results to be placed
lat = []
lon = []

# For each row in a varible,
for row in merged['Coordinates']:
    # Try to,
    try:
        # Split the row by comma, convert to float, and append
        # everything before the comma to lat
        lat.append(float(row.split(',')[0]))
        # Split the row by comma, convert to float, and append
        # everything after the comma to lon
        lon.append(float(row.split(',')[1]))
    # But if you get an error
    except:
        # append a missing value to lat
        lat.append(np.NaN)
        # append a missing value to lon
        lon.append(np.NaN)

# Create two new columns from lat and lon
merged['Latitude'] = lat
merged['Longitude'] = lon



In [43]:

    
merged.head()









    Out[43]:






  
    
      
      Restaurant_Name
      Zip_Code
      Inspection_Date
      Score
      Address
      Facility_ID
      Process_Description
      string_id
      Med_Income
      Population
      Home_Ownership
      Coordinates
      Latitude
      Longitude
    
  
  
    
      0
      Mr. Gatti's #118
      78758
      2015-12-23
      94
      2121 W PARMER LN\nAUSTIN, TX 78758\n(30.415649...
      2801996
      Routine Inspection
      mr. gatti's #118_2801996
      42398
      42977
      0.309
      30.4156492668216, -97.6975693977174
      30.415649
      -97.697569
    
    
      1
      Subway
      78758
      2015-12-23
      98
      2501 W PARMER LN\nAUSTIN, TX 78758\n(30.418236...
      10385802
      Routine Inspection
      subway_10385802
      42398
      42977
      0.309
      30.4182363972979, -97.7022420545932
      30.418236
      -97.702242
    
    
      2
      Baskin Robbins
      78758
      2015-12-23
      99
      12407 N MOPAC EXPY\nAUSTIN, TX 78758\n(30.4174...
      2802274
      Routine Inspection
      baskin robbins_2802274
      42398
      42977
      0.309
      30.4174621728585, -97.7028442208035
      30.417462
      -97.702844
    
    
      3
      JR's Tacos
      78758
      2015-12-22
      91
      1921 CEDAR BEND DR\nAUSTIN, TX 78758\n(30.4083...
      10964220
      Routine Inspection
      jr's tacos_10964220
      42398
      42977
      0.309
      30.408322328997, -97.6982433226702
      30.408322
      -97.698243
    
    
      4
      Econo Lodge
      78758
      2015-12-22
      91
      9100 BURNET RD\nAUSTIN, TX 78758\n(30.37479048...
      10778546
      Routine Inspection
      econo lodge_10778546
      42398
      42977
      0.309
      30.3747904817031, -97.7264170496741
      30.374790
      -97.726417



In [44]:

    
# Create a DataFrame off of column [1] of merged_location: City and State-Zip Code:

cities = merged_location[1].apply(lambda x: pd.Series(x.split(',')))



In [45]:

    
cities.head()



In [46]:

    
# How many are the unique cities in our DataFrame?

pd.unique(cities[0].ravel())









    Out[46]:





array(['AUSTIN', 'ROLLINGWOOD', 'WEST LAKE HILLS', 'SUNSET VALLEY',
       'BEE CAVE'], dtype=object)



In [47]:

    
# All our restaurants are in Texas, and we have already dealt with Zip Codes. Ergo we don't need column [1]:

del cities[1]



In [48]:

    
cities.head()



In [49]:

    
# Let us add this neatly divided information into our main DataFrame:

merged['Street'] = merged_location[0]
merged['City'] = cities[0]
merged['Geocode'] = geocode



In [50]:

    
merged.head(10)









    Out[50]:






  
    
      
      Restaurant_Name
      Zip_Code
      Inspection_Date
      Score
      Address
      Facility_ID
      Process_Description
      string_id
      Med_Income
      Population
      Home_Ownership
      Coordinates
      Latitude
      Longitude
      Street
      City
      Geocode
    
  
  
    
      0
      Mr. Gatti's #118
      78758
      2015-12-23
      94
      2121 W PARMER LN\nAUSTIN, TX 78758\n(30.415649...
      2801996
      Routine Inspection
      mr. gatti's #118_2801996
      42398
      42977
      0.309
      30.4156492668216, -97.6975693977174
      30.415649
      -97.697569
      2121 W PARMER LN
      AUSTIN
      2121 W PARMER LN, AUSTIN, TX 78758
    
    
      1
      Subway
      78758
      2015-12-23
      98
      2501 W PARMER LN\nAUSTIN, TX 78758\n(30.418236...
      10385802
      Routine Inspection
      subway_10385802
      42398
      42977
      0.309
      30.4182363972979, -97.7022420545932
      30.418236
      -97.702242
      2501 W PARMER LN
      AUSTIN
      2501 W PARMER LN, AUSTIN, TX 78758
    
    
      2
      Baskin Robbins
      78758
      2015-12-23
      99
      12407 N MOPAC EXPY\nAUSTIN, TX 78758\n(30.4174...
      2802274
      Routine Inspection
      baskin robbins_2802274
      42398
      42977
      0.309
      30.4174621728585, -97.7028442208035
      30.417462
      -97.702844
      12407 N MOPAC EXPY
      AUSTIN
      12407 N MOPAC EXPY, AUSTIN, TX 78758
    
    
      3
      JR's Tacos
      78758
      2015-12-22
      91
      1921 CEDAR BEND DR\nAUSTIN, TX 78758\n(30.4083...
      10964220
      Routine Inspection
      jr's tacos_10964220
      42398
      42977
      0.309
      30.408322328997, -97.6982433226702
      30.408322
      -97.698243
      1921 CEDAR BEND DR
      AUSTIN
      1921 CEDAR BEND DR, AUSTIN, TX 78758
    
    
      4
      Econo Lodge
      78758
      2015-12-22
      91
      9100 BURNET RD\nAUSTIN, TX 78758\n(30.37479048...
      10778546
      Routine Inspection
      econo lodge_10778546
      42398
      42977
      0.309
      30.3747904817031, -97.7264170496741
      30.374790
      -97.726417
      9100 BURNET RD
      AUSTIN
      9100 BURNET RD, AUSTIN, TX 78758
    
    
      5
      Shahi Food Market
      78758
      2015-12-22
      80
      12410 N LAMAR BLVD\nAUSTIN, TX 78758\n(30.4066...
      2803821
      Routine Inspection
      shahi food market_2803821
      42398
      42977
      0.309
      30.4066900253572, -97.6788692603716
      30.406690
      -97.678869
      12410 N LAMAR BLVD
      AUSTIN
      12410 N LAMAR BLVD, AUSTIN, TX 78758
    
    
      6
      K1 Speed
      78758
      2015-12-22
      87
      2500 MC HALE CT\nAUSTIN, TX 78758\n(30.3851472...
      10771103
      Routine Inspection
      k1 speed_10771103
      42398
      42977
      0.309
      30.3851472596582, -97.7212246080011
      30.385147
      -97.721225
      2500 MC HALE CT
      AUSTIN
      2500 MC HALE CT, AUSTIN, TX 78758
    
    
      7
      Jasper's
      78758
      2015-12-22
      83
      11506 CENTURY OAKS TER\nAUSTIN, TX 78758\n(30....
      10006802
      Routine Inspection
      jasper's_10006802
      42398
      42977
      0.309
      30.4025739974381, -97.7258935626451
      30.402574
      -97.725894
      11506 CENTURY OAKS TER
      AUSTIN
      11506 CENTURY OAKS TER, AUSTIN, TX 78758
    
    
      8
      Papa John's Pizza #930
      78758
      2015-12-22
      84
      1925 RUTLAND DR\nAUSTIN, TX 78758\n(30.3788756...
      2800815
      Routine Inspection
      papa john's pizza #930_2800815
      42398
      42977
      0.309
      30.3788756448479, -97.7130952607857
      30.378876
      -97.713095
      1925 RUTLAND DR
      AUSTIN
      1925 RUTLAND DR, AUSTIN, TX 78758
    
    
      9
      Subway #43067
      78758
      2015-12-21
      96
      1910 W BRAKER LN\nAUSTIN, TX 78758\n(30.392307...
      10091620
      Routine Inspection
      subway #43067_10091620
      42398
      42977
      0.309
      30.3923075644652, -97.7066994048772
      30.392308
      -97.706699
      1910 W BRAKER LN
      AUSTIN
      1910 W BRAKER LN, AUSTIN, TX 78758



In [51]:

    
# How many kinds of inspections are there?

pd.unique(merged['Process_Description'].ravel())









    Out[51]:





array(['Routine Inspection', '1st Follow Up to Routine of 50 - 69',
       '2nd Follow Up to 50 - 69', '1st Follow Up to Routine of 49 or less'], dtype=object)



In [52]:

    
# Delete all columns we don't need anymore:

del merged['Address']
del merged['Coordinates']
del merged['string_id']



In [53]:

    
# Rearrange remaining columns in a nice way:

merged = merged[['Facility_ID', 'Restaurant_Name', 'Inspection_Date',\
                 'Process_Description', 'Geocode', 'Street', 'City', 'Zip_Code', \
                 'Score', 'Med_Income', 'Population', 'Home_Ownership', 'Latitude', 'Longitude']]

merged.head(3)









    Out[53]:






  
    
      
      Facility_ID
      Restaurant_Name
      Inspection_Date
      Process_Description
      Geocode
      Street
      City
      Zip_Code
      Score
      Med_Income
      Population
      Home_Ownership
      Latitude
      Longitude
    
  
  
    
      0
      2801996
      Mr. Gatti's #118
      2015-12-23
      Routine Inspection
      2121 W PARMER LN, AUSTIN, TX 78758
      2121 W PARMER LN
      AUSTIN
      78758
      94
      42398
      42977
      0.309
      30.415649
      -97.697569
    
    
      1
      10385802
      Subway
      2015-12-23
      Routine Inspection
      2501 W PARMER LN, AUSTIN, TX 78758
      2501 W PARMER LN
      AUSTIN
      78758
      98
      42398
      42977
      0.309
      30.418236
      -97.702242
    
    
      2
      2802274
      Baskin Robbins
      2015-12-23
      Routine Inspection
      12407 N MOPAC EXPY, AUSTIN, TX 78758
      12407 N MOPAC EXPY
      AUSTIN
      78758
      99
      42398
      42977
      0.309
      30.417462
      -97.702844



In [54]:

    
# We have 16113 rows with neat addresses which we will use for mapping

len(merged['Geocode'])









    Out[54]:





16113

Finally, the interesting visualization I promised. Even though this will be studied in more detail in the next "chapter" of this project, supervised learning, I was wondering whether there is any stark correlation between a restaurant's health inspection score (response) and each one of the numerics columns I have added to our data: Median Income, Population and Home Ownership percentage (features)



In [55]:

    
# Visualize the relationship between the features and the response using scatterplots

fig, axs = plt.subplots(1, 3, sharey=True)

merged.plot(kind='scatter', x='Med_Income', y='Score', ax=axs[0], figsize=(16, 8));
merged.plot(kind='scatter', x='Population', y='Score', ax=axs[1]);
merged.plot(kind='scatter', x='Home_Ownership', y='Score', ax=axs[2]);

At first sight the data looks scattered all over the place for all three plots, and there's possibly no correlation between any of the pairs. However there are some interesting remarks one could make:

As we have already seen, the vast majority of restaurants scores over 70 (passing grade)
In areas (Zip Codes) where the median income is relatively lower, it is more probable that there will be restaurants doing worse than average. Notice that when the median income is higher than \$120,000, there is not a single restaurant failing the inspection. However this might just be a random fluctuation of our data, since there are a few failing restaurants in areas where the median income lies between \$80,000 and \$120,000
Population really doesn't seem to play any role; There is a more or less even distribution of health inspection grades irrespective of the Zip Code's population: both the sparsely and the densely populated areas of town have their share of passing and failing restaurants.
The scatterplot of restaurant health inspection scores vs. percentage of home ownership looks very similar to the one of scores vs median income. We can easily check whether the two quantities are correlated, which we intuitively expect.

Next: Supervised Learning - Regression Analysis

Let us save our working DataFrame as a csv file so we can import it into the next Notebook.



In [56]:

    
merged.to_csv('../data/master.csv', index=False)

We are also going to take a slight detour to the Jupyter Notebook 1a. Mapping Austin Restaurant Scores where we are going to get our feet wet with geocoding and mapping. Therefore we will need the DataFrame we have been working on so far to be portable to a different notebook.

	Restaurant Name	Zip Code	Inspection Date	Score	Address	Facility ID	Process Description
0	Corner Store #3958	78754	12/23/2015	100	12307 DESSAU RD\nAUSTIN, TX 78754\n(30.3872270...	11281491	Routine Inspection
1	Mr. Gatti's #118	78758	12/23/2015	94	2121 W PARMER LN\nAUSTIN, TX 78758\n(30.415649...	2801996	Routine Inspection
2	Texaco Braker Mart	78758	12/23/2015	83	2601 W BRAKER LN\nAUSTIN, TX 78758\n(30.390268...	10001712	Routine Inspection
3	Taco Bell	78753	12/23/2015	93	9504 N LAMAR BLVD\nAUSTIN, TX 78753\n(30.36480...	11066396	Routine Inspection
4	Subway	78758	12/23/2015	98	2501 W PARMER LN\nAUSTIN, TX 78758\n(30.418236...	10385802	Routine Inspection

	Zip_Code	Score	Facility_ID
count	16113.000000	16113.000000	16113.000000
mean	78732.831565	90.556383	7290115.054118
std	21.670047	8.003905	3846991.045580
min	78701.000000	36.000000	2800004.000000
25%	78705.000000	86.000000	2801979.000000
50%	78741.000000	92.000000	10091552.000000
75%	78752.000000	97.000000	10662646.000000
max	78759.000000	100.000000	11386831.000000

	Zip_Code	Med_Income	Population	Home_Ownership
0	78701	35757	3780	0.377
1	78702	23348	22498	0.467
2	78703	54591	19522	0.515
3	78705	14740	26972	0.110
4	78721	26646	10192	0.576
5	78723	34242	30196	0.441
6	78712	NaN	NaN	NaN
7	78751	29779	13829	0.259
8	78756	36978	7253	0.395
9	78744	38256	34028	0.536
10	78747	60861	4908	0.890
11	78719	NaN	NaN	NaN
12	78741	25369	40678	0.146
13	78752	30207	17978	0.266
14	78753	38206	43788	0.407
15	78754	51810	5517	0.110
16	78704	35733	43343	0.302
17	78745	43458	53136	0.090
18	78748	57710	25138	0.739
19	78739	102707	8708	0.962
20	78749	68244	28420	0.360
21	78735	75204	9563	0.564
22	78733	102239	9144	0.901
23	78746	100571	25768	0.710
24	78757	45090	21434	0.566
25	78758	42398	42977	0.309
26	78727	62648	22332	0.573
27	78722	35794	6264	0.468
28	78729	59497	24539	0.516
29	78717	87290	8209	0.793
30	78750	78428	23563	0.717
31	78759	61284	40327	0.451
32	78726	89891	6547	0.700
33	78730	128524	4848	0.740
34	78731	62404	24068	0.593
35	78732	103951	3804	0.970

	Restaurant_Name	Zip_Code	Inspection_Date	Score	Address	Facility_ID	Process_Description	string_id	Med_Income	Population	Home_Ownership
0	Mr. Gatti's #118	78758	2015-12-23	94	2121 W PARMER LN\nAUSTIN, TX 78758\n(30.415649...	2801996	Routine Inspection	mr. gatti's #118_2801996	42398	42977	0.309
1	Subway	78758	2015-12-23	98	2501 W PARMER LN\nAUSTIN, TX 78758\n(30.418236...	10385802	Routine Inspection	subway_10385802	42398	42977	0.309
2	Baskin Robbins	78758	2015-12-23	99	12407 N MOPAC EXPY\nAUSTIN, TX 78758\n(30.4174...	2802274	Routine Inspection	baskin robbins_2802274	42398	42977	0.309
3	JR's Tacos	78758	2015-12-22	91	1921 CEDAR BEND DR\nAUSTIN, TX 78758\n(30.4083...	10964220	Routine Inspection	jr's tacos_10964220	42398	42977	0.309
4	Econo Lodge	78758	2015-12-22	91	9100 BURNET RD\nAUSTIN, TX 78758\n(30.37479048...	10778546	Routine Inspection	econo lodge_10778546	42398	42977	0.309

	0	1	2
0	2121 W PARMER LN	AUSTIN, TX 78758	(30.4156492668216, -97.6975693977174)
1	2501 W PARMER LN	AUSTIN, TX 78758	(30.4182363972979, -97.7022420545932)
2	12407 N MOPAC EXPY	AUSTIN, TX 78758	(30.4174621728585, -97.7028442208035)
3	1921 CEDAR BEND DR	AUSTIN, TX 78758	(30.408322328997, -97.6982433226702)
4	9100 BURNET RD	AUSTIN, TX 78758	(30.3747904817031, -97.7264170496741)

	0	1
0	AUSTIN	TX 78758
1	AUSTIN	TX 78758
2	AUSTIN	TX 78758
3	AUSTIN	TX 78758
4	AUSTIN	TX 78758